Data description
In this tutorial, we aim to obtain differentially expressed genes (DEGs) between two biological conditions in an integrated data. This integrated data set is generated by combining 10 microarray studies on control and TGFb-treated samples (for more information about the data sets, see paper A Transcriptional Program for Detecting TGFβ-Induced EMT in Cancer by Foroutan M et al, 2017.
First, we assess the presence of unwanted variation in the integrated data which combines differents studies and platforms. Second, we show how to use RUV4 to remove unwanted variation and detect DEGs comparing control to TGFb-treated samples. Finally, we assess whether or not running RUV4 method helps, where we include the analysis of unadjusted data.
library(ruv) ## for applying RUV methods
library(limma) ## for vennDiagram()
library(ggplot2) ## for data visualisation
The integrated data introduced above have been split into two data sets: dataset A and dataset B, each containing different studies, platforms and tissues. We will explore and normalise these two data sets separately in order to compare the results obtained by the normalisation method RUV4.
Read in the two integrated datasets.
datasetA<- read.table("expr_10data_datasetA.txt", header = T, sep = "\t")
datasetB<- read.table("expr_10data_datasetB.txt", header = T, sep = "\t")
Look at the each data sets A and B, then make a matrix for each these data sets where the row names are gene IDs.
kable(datasetA[1:4,1:4])
| EntrezID| D_Ctrl_R1| D_TGFb_R1| D_Ctrl_R2|
|--------:|---------:|---------:|---------:|
| 2| 4.182898| 4.234203| 4.271958|
| 9| 5.763352| 6.063185| 5.698005|
| 10| 4.467305| 4.427406| 4.403148|
| 12| 8.597186| 7.814175| 9.019141|
kable(datasetA[1:4,1:4])
| EntrezID| D_Ctrl_R1| D_TGFb_R1| D_Ctrl_R2|
|--------:|---------:|---------:|---------:|
| 2| 4.182898| 4.234203| 4.271958|
| 9| 5.763352| 6.063185| 5.698005|
| 10| 4.467305| 4.427406| 4.403148|
| 12| 8.597186| 7.814175| 9.019141|
mA<-datasetA[,2:dim(datasetA)[2]]
mB<-datasetB[,2:dim(datasetB)[2]]
row.names(mA)<- datasetA[,1]
row.names(mB)<- datasetB[,1]
mA<- as.matrix(mA)
mB<- as.matrix(mB)
Look at the information related to each dataset including the name of the studies, types of platform, treatment, and tissue:
info_datasetA <- read.table("info_10data_datasetA.txt", sep="\t", header=T)
kable(info_datasetA)
|samples |study |treatment |platform |tissue |
|:-----------|:----------|:---------|:-------------|:---------|
|D_Ctrl_R1 |Deshiere |Ctrl |Agilent |Breast |
|D_TGFb_R1 |Deshiere |TGFb |Agilent |Breast |
|D_Ctrl_R2 |Deshiere |Ctrl |Agilent |Breast |
|D_TGFb_R2 |Deshiere |TGFb |Agilent |Breast |
|Hes_Ctrl_R1 |Hesling |Ctrl |HG_U133_Plus2 |Breast |
|Hes_TGFb_R1 |Hesling |TGFb |HG_U133_Plus2 |Breast |
|Hes_Ctrl_R2 |Hesling |Ctrl |HG_U133_Plus2 |Breast |
|Hes_TGFb_R2 |Hesling |TGFb |HG_U133_Plus2 |Breast |
|Hil_Ctrl_R1 |Hills |Ctrl |Illumina |Kidney |
|Hil_Ctrl_R2 |Hills |Ctrl |Illumina |Kidney |
|Hil_Ctrl_R3 |Hills |Ctrl |Illumina |Kidney |
|Hil_TGFb_R1 |Hills |TGFb |Illumina |Kidney |
|Hil_TGFb_R2 |Hills |TGFb |Illumina |Kidney |
|Hil_TGFb_R3 |Hills |TGFb |Illumina |Kidney |
|M_Ctrl_R1 |Maupin |Ctrl |HG_U133_Plus2 |Pancrease |
|M_Ctrl_R2 |Maupin |Ctrl |HG_U133_Plus2 |Pancrease |
|M_Ctrl_R3 |Maupin |Ctrl |HG_U133_Plus2 |Pancrease |
|M_TGFb_R1 |Maupin |TGFb |HG_U133_Plus2 |Pancrease |
|M_TGFb_R2 |Maupin |TGFb |HG_U133_Plus2 |Pancrease |
|M_TGFb_R3 |Maupin |TGFb |HG_U133_Plus2 |Pancrease |
|K_Ctrl_R1 |Keshamouni |Ctrl |HG_U133_Plus2 |Lung |
|K_Ctrl_R2 |Keshamouni |Ctrl |HG_U133_Plus2 |Lung |
|K_Ctrl_R3 |Keshamouni |Ctrl |HG_U133_Plus2 |Lung |
|K_TGFb_R1 |Keshamouni |TGFb |HG_U133_Plus2 |Lung |
|K_TGFb_R2 |Keshamouni |TGFb |HG_U133_Plus2 |Lung |
|K_TGFb_R3 |Keshamouni |TGFb |HG_U133_Plus2 |Lung |
|S_A_Ctrl_R1 |Sun_A549 |Ctrl |HG_U133_Plus2 |Lung |
|S_A_Ctrl_R2 |Sun_A549 |Ctrl |HG_U133_Plus2 |Lung |
|S_A_Ctrl_R3 |Sun_A549 |Ctrl |HG_U133_Plus2 |Lung |
|S_A_TGFb_R1 |Sun_A549 |TGFb |HG_U133_Plus2 |Lung |
|S_A_TGFb_R2 |Sun_A549 |TGFb |HG_U133_Plus2 |Lung |
|S_A_TGFb_R3 |Sun_A549 |TGFb |HG_U133_Plus2 |Lung |
info_datasetB <- read.table("info_10data_datasetB.txt", sep="\t", header=T)
kable(info_datasetB)
|samples |study |treatment |platform |tissue |
|:-------------|:-------|:---------|:-------------|:------|
|S_HCC_Ctrl_R1 |Sun_HCC |Ctrl |HG_U133_Plus2 |Lung |
|S_HCC_Ctrl_R2 |Sun_HCC |Ctrl |HG_U133_Plus2 |Lung |
|S_HCC_Ctrl_R3 |Sun_HCC |Ctrl |HG_U133_Plus2 |Lung |
|S_HCC_TGFb_R1 |Sun_HCC |TGFb |HG_U133_Plus2 |Lung |
|S_HCC_TGFb_R2 |Sun_HCC |TGFb |HG_U133_Plus2 |Lung |
|S_HCC_TGFb_R3 |Sun_HCC |TGFb |HG_U133_Plus2 |Lung |
|S_NCI_Ctrl_R1 |Sun_NCI |Ctrl |HG_U133_Plus2 |Lung |
|S_NCI_Ctrl_R2 |Sun_NCI |Ctrl |HG_U133_Plus2 |Lung |
|S_NCI_Ctrl_R3 |Sun_NCI |Ctrl |HG_U133_Plus2 |Lung |
|S_NCI_TGFb_R1 |Sun_NCI |TGFb |HG_U133_Plus2 |Lung |
|S_NCI_TGFb_R2 |Sun_NCI |TGFb |HG_U133_Plus2 |Lung |
|S_NCI_TGFb_R3 |Sun_NCI |TGFb |HG_U133_Plus2 |Lung |
|T_Ctrl1_R1 |Taube |Ctrl |HT_HG_U133A |Breast |
|T_Ctrl1_R2 |Taube |Ctrl |HT_HG_U133A |Breast |
|T_Ctrl1_R3 |Taube |Ctrl |HT_HG_U133A |Breast |
|T_TGFb_R1 |Taube |TGFb |HT_HG_U133A |Breast |
|T_TGFb_R2 |Taube |TGFb |HT_HG_U133A |Breast |
|T_TGFb_R3 |Taube |TGFb |HT_HG_U133A |Breast |
|T_Ctrl2_R1 |Taube |Ctrl |HT_HG_U133A |Breast |
|T_Ctrl2_R2 |Taube |Ctrl |HT_HG_U133A |Breast |
|T_Ctrl2_R3 |Taube |Ctrl |HT_HG_U133A |Breast |
|W_Ctrl_R1 |Walsh |Ctrl |HG_U133A_2 |Kidney |
|W_Ctrl_R2 |Walsh |Ctrl |HG_U133A_2 |Kidney |
|W_Ctrl_R3 |Walsh |Ctrl |HG_U133A_2 |Kidney |
|W_TGFb_R1 |Walsh |TGFb |HG_U133A_2 |Kidney |
|W_TGFb_R2 |Walsh |TGFb |HG_U133A_2 |Kidney |
|W_TGFb_R3 |Walsh |TGFb |HG_U133A_2 |Kidney |
Assessment of unwanted variation in the data
Here we perform some exploratory analyses on the integrated data sets A and B to assess the presence of unwanted variation in each dataset.
RLE plot
We start by looking at the RLE plots in dataset A, coloured by study, platform and tissue. In a “good” RLE plot, the median of all boxplots (i.e. samples) are around zero, while high deviations from zero show the presence of unwanted variation. For more information about RLE plot see paper RLE Plots: Visualising Unwanted Variation in High Dimensional Data, by Luke C. Gandolfo, Terence P. Speed, 2017.
In the current example, it’s important to note that in some cases, different studies are confounded with different platforms and tissues, and therefore there is no way to identify how much of the unwanted variation come from each of these factors. Such situations must be avoided when designing an experiment, however, it often happens when we perform data integration. For the purpose of this tutorial, we only consider “study” as the source of unwanted variation.
## Transpose the expression matrix so that we have genes in columns and dataset in rows
YA <- t(mA)
## Plot RLE coloured by study
ruv_rle(YA, info_datasetA$study, ylim = c(-5,5))

## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$platform, ylim = c(-5,5))

## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$tissue, ylim = c(-5,5))

Similarly, we look at the RLE plots in dataset B coloured by study, platform and tissue:
YB <- t(mB)
## Plot RLE coloured by study
ruv_rle(YB, info_datasetB$study, ylim = c(-5,5))

## Plot RLE coloured by platform
ruv_rle(YB, info_datasetB$platform, ylim = c(-5,5))

## Plot RLE coloured by tissue
ruv_rle(YB, info_datasetB$tissue, ylim = c(-5,5))

PCA plot
In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. To assess the presence of unwanted variation in each dataset, we use PCA plots to show similarities between samples measured in an unsupervised way. Ideally, samples should be clustered together according to the treatment (i.e. the biological factor of interest). Here, we see that samples are rather clustered by studies (i.e. unwanted variation) in both data sets A and B.
gg_additions <- list(aes(color = info_datasetA$study,
shape = info_datasetA$treatment,
size = 5, alpha = .7),
labs(color = "Study", shape="Treatment"),
scale_size_identity(guide = "none"),
scale_alpha(guide = "none"),
theme(legend.text = element_text(size = 12),
legend.title = element_text(size = 16)),
guides(color = guide_legend(override.aes = list(size = 4)),
shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width = 8, repr.plot.height = 6)
ruv_svdplot(YA) + gg_additions

gg_additions <- list(aes(color = info_datasetB$study,
shape = info_datasetB$treatment,
size = 5, alpha = .7),
labs(color = "Study",shape = "Treatment"),
scale_size_identity(guide = "none"),
scale_alpha(guide = "none"),
theme(legend.text = element_text(size = 12),
legend.title = element_text(size = 16)),
guides(color = guide_legend(override.aes = list(size = 4)),
shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width=8, repr.plot.height=6)
ruv_svdplot(YB) + gg_additions #

Remove batch effects using RUV-4
There are several RUV methods for removing unwanted variation in order to obtain DEGs, including RUV-2, RUV-4, RUV-inv and RUV-rinv. In general, RUV methods are dependent on negative control genes (genes which are not associated with the biological factor of interest) and replicate samples (if applicable). RUV-2 removes unwanted variation in two steps. RUV-4 was introduced after RUV-2 and has four steps. For RUV-4 we can select different values for dimension of unwanted variation (k), while for RUV-inv and RUV-rinv we don’t need to change k as it is set to be the maximum value. RUV-inv is recommended when we have large number of control genes (~1000), while RUV-rinv is more appropriate with small number of control genes (~60).
Selection of negative control genes is very important. Examples of negative control genes are the spike-in controls or the housekeeping (HK) genes. It is also possible to define empirical negative control genes using an iterative approach. However, it is only recomended if (i) the initial negative control genes are not very good or there are only a few of them; (ii) the beta seems to be very sparse. Therefore, the user needs to generate diagnostic plots to assess the performance of the initial analysis, and only if needed, use an iterative approach to define better control genes. For this tutorial, we use HK genes as negative control genes.
RUV4
RUV4 can be run with different values of k in an attempt to find the optimal value. Note that although there is a function to estimate k, called getK(), this may not give the optimal value for k and is often recomended to be used when there is no other choice but to automate finding k (e.g. in simulations).
We begin with the list of housekeeping (HK) genes as our negative controls.
HKgenes <- read.table("HouseKeeping_genes_IDs.txt", header=T, sep="\t")
hk <- HKgenes$GeneID
ctrl <- colnames(YA) %in% hk
head(ctrl)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
We also define our biological factor of interest (i.e. control versus TGFb treatment)
## Take treatment as the biological factor of interest in data set A
groups_A <- factor(info_datasetA$treatment)
gA <- cbind(as.numeric(groups_A)) ## 1 control, 2: TGFb
## Take treatment as the biological factor of interest in data set B
groups_B <- factor(info_datasetB$treatment)
gB <- cbind(as.numeric(groups_B)) ## 1 control, 2: TGFb
Unadjusted data
RUV4 with k = 0 performs no adjustment when obtaining DEGs, we do this first to see what would be the results if we did not adjust for unwanted variation.
# RUV4 with k = 0 for no adjustment
#----- In dataset A data
fit_unadj_hk_datasetA <- RUV4(YA, X = gA,
ctl = ctrl,
k = 0)
fit_unadj_hk_datasetA.summary <- ruv_summary(YA,
fit_unadj_hk_datasetA,
info_datasetA)
##----- In dataset B data
fit_unadj_hk_datasetB <- RUV4(YB, X = gB,
ctl = ctrl,
k = 0)
fit_unadj_hk_datasetB.summary <- ruv_summary(YB,
fit_unadj_hk_datasetB,
info_datasetB)
Look at the consistency between the results from data sets A and B when no adjustment is perfromed. Here, we look at the correlation between betahats obtained in the two data sets.
##----- Unadjusted data sets
plot(fit_unadj_hk_datasetA$betahat,
fit_unadj_hk_datasetB$betahat,
xlab = "Betahat dataset A",
ylab = "Betahat dataset B",
main = "Unadjusted",
xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
corVal <- cor.test(fit_unadj_hk_datasetA$betahat, fit_unadj_hk_datasetB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))

RUV4-adjusted data with different values of k
## Take treatment as the biological factor of interest in data set A
groups_A <- factor(info_datasetA$treatment)
gA <- cbind(as.numeric(groups_A)) ## 1 control, 2: TGFb
## Take treatment as the biological factor of interest in data set B
groups_B <- factor(info_datasetB$treatment)
gB <- cbind(as.numeric(groups_B)) ## 1 control, 2: TGFb
## Select different values of k
ks <- c( 1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)
## Define empty vector and lists and use them in the for loop to store the results
# beta_corAB_HK <- vector()
fit_ruv4_hk_datasetA_all_k <- list()
fit_ruv4_hk_datasetA_all_k.summary <- list()
fit_ruv4_hk_datasetB_all_k <- list()
fit_ruv4_hk_datasetB_all_k.summary <- list()
for (K in ks){
fit_ruv4_hk_datasetA_all_k[[K]] <- RUV4(YA, X = gA,
ctl = ctrl,
k = K, Z = 1, eta = NULL,
fullW0 = NULL, inputcheck = TRUE)
fit_ruv4_hk_datasetA_all_k.summary[[K]] <- ruv_summary(YA,
fit_ruv4_hk_datasetA_all_k[[K]],
info_datasetA)
fit_ruv4_hk_datasetB_all_k[[K]] <- RUV4(YB, X = gB,
ctl = ctrl,
k = K,Z = 1, eta = NULL,
fullW0 = NULL, inputcheck = TRUE)
fit_ruv4_hk_datasetB_all_k.summary[[K]] <- ruv_summary(YB,
fit_ruv4_hk_datasetB_all_k[[K]],
info_datasetB)
# currentCor <- cor.test(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
# fit_ruv4_hk_datasetB_all_k[[K]]$betahat)$estimate
# beta_corAB_HK <- c(beta_corAB_HK, currentCor)
plot(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
fit_ruv4_hk_datasetB_all_k[[K]]$betahat,
xlab = "Betahat dataset A",
ylab = "Betahat dataset B",
main = paste("RUV4_K_", K, sep=""),
xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
#abline(fit_ruv4_emp_datasetB$betahat,fit_ruv4_emp_datasetA$betahat)
corVal <- cor.test(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
fit_ruv4_hk_datasetB_all_k[[K]]$betahat)$estimate
text(-3,3, pos=4, paste("Correlation: ", round(corVal,2),sep=""))
}















# names(beta_corAB_HK) <- ks
# data.frame(beta_corAB_HK) ## K = 23 seems a good choice
P-value distributions for unadjusted and RUV4-adjusted data sets
Here we compare the results obtained from unajusted and RUV4- adjusted data using p-value distributions.
## We select K = 23
K = 23
##----- In data A
ruv_hist(fit_unadj_hk_datasetA.summary) + ggtitle("Unadj_A")

ruv_hist(fit_ruv4_hk_datasetA_all_k.summary[[K]]) + ggtitle("RUV4_HK_A")

## ---- In data B
ruv_hist(fit_unadj_hk_datasetB.summary) + ggtitle("Unadj_B")

ruv_hist(fit_ruv4_hk_datasetB_all_k.summary[[K]]) + ggtitle("RUV4_HK_B")

Overlapping DEGs between data sets A and B
Finally, for each of the unadjusted and RUV4-adjusted data sets, we look at the number of overlapping differentially expressed genes (DEGs) between the two data sets A and B.
First, we define DEGs as genes with adjusted p-value < 0.05 and |logFC| > 1 in data sets A and B which are unadjusted, RUV4-adjusted with K = 23, and RUV-4 adjusted with different values of K. In the next step, we look at the overlapping genes.
DEGs in the unadjusted data sets
Venn diagram comparing DEGs obtained from the unadjusted data sets A and B.
##------ Define DEGs in the unadjusted data set A
DEGsUnadj_datasetA <- row.names(fit_unadj_hk_datasetA.summary$C)[fit_unadj_hk_datasetA.summary$C$F.p.BH < 0.05 & abs(fit_unadj_hk_datasetA.summary$C$b_X1) > 1]
##------ Define DEGs in the unadjusted data set B
DEGsUnadj_datasetB <- row.names(fit_unadj_hk_datasetB.summary$C)[fit_unadj_hk_datasetB.summary$C$F.p.BH < 0.05 & abs(fit_unadj_hk_datasetB.summary$C$b_X1) > 1]
##------ Explore how many genes are overlapping between data sets A and B
allDEGs_unadj <- c(DEGsUnadj_datasetA,
DEGsUnadj_datasetB)
## remove duplicated gene symbols:
allDEGs_unadj <- allDEGs_unadj[!duplicated(allDEGs_unadj)]
## Draw a Venn diagram comparing DEGs for RUVinv
Counts_unadj <- matrix(0, nrow = length(allDEGs_unadj), ncol = 2)
row.names(Counts_unadj)<- allDEGs_unadj
colnames(Counts_unadj)<- c("Unadj_A","Unadj_B")
for( i in 1:length(allDEGs_unadj)) {
Counts_unadj[i,1]<- allDEGs_unadj[i] %in% DEGsUnadj_datasetA
Counts_unadj[i,2]<- allDEGs_unadj[i] %in% DEGsUnadj_datasetB
}
col <- c("blue", "violet")
vennDiagram(vennCounts(Counts_unadj),
circle.col = col,
cex = c(1.6, 1.2, 1), lwd=2)

DEGs in the RUV4-adjusted data sets for different values of k
Venn diagram comparing the RUV4-adjusted data sets A and B using different ks.
##---- Define DEGs for different values of k in data set A
DEGsRUV4_datasetA_all_k <- list()
for (K in ks){
DEGsRUV4_datasetA_all_k[[K]] <- row.names(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$b_X1) > 1]
}
##---- Define DEGs for different values of k in data set B
DEGsRUV4_datasetB_all_k <- list()
for (K in ks){
DEGsRUV4_datasetB_all_k[[K]] <- row.names(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$b_X1) > 1]
}
##---- Generate Venn diagrams, looking at the overlapping genes
## between data sets A and B for different values of k
allDEGs_RUV4_all_k <- list()
for (K in ks){
allDEGs_RUV4_all_k[[K]] <- c(DEGsRUV4_datasetA_all_k[[K]],
DEGsRUV4_datasetB_all_k[[K]])
## remove duplicated gene symbols:
allDEGs_RUV4_all_k[[K]] <- allDEGs_RUV4_all_k[[K]][!duplicated(allDEGs_RUV4_all_k[[K]])]
## Draw a Venn diagram comparing DEGs for RUV4
Counts_RUV4<- matrix(0, nrow = length(allDEGs_RUV4_all_k[[K]]), ncol=2)
row.names(Counts_RUV4) <- allDEGs_RUV4_all_k[[K]]
colnames(Counts_RUV4) <- c(paste("RUV4_A_K_", K, sep=""), paste("RUV4_B_K_",K,sep=""))
for( i in 1:length(allDEGs_RUV4_all_k[[K]])) {
Counts_RUV4[i,1]<- allDEGs_RUV4_all_k[[K]][i] %in% DEGsRUV4_datasetA_all_k[[K]]
Counts_RUV4[i,2]<- allDEGs_RUV4_all_k[[K]][i] %in% DEGsRUV4_datasetB_all_k[[K]]
}
col<- c("blue", "violet")
vennDiagram(vennCounts(Counts_RUV4),
circle.col = col,
cex = c(1.6, 1.2, 1), lwd = 2)
}















Overlapping DEGs between unadjusted and RUV4-adjusted data sets
Compare DEGs from the unadjusted and RUV4-adjusted data set A
K = 23
##----- Define DEGs for the RUV4-adjusted data with K = 23 in data set A
DEGsRUV4_datasetA <- row.names(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$b_X1) > 1]
##----- Look at the overalapping genes between unadjusted and RUV-4 adjusted data set A
allDEGs_datasetA <- c(DEGsUnadj_datasetA,
DEGsRUV4_datasetA_all_k[[K]])
## remove duplicated gene symbols:
allDEGs_datasetA <- allDEGs_datasetA[!duplicated(allDEGs_datasetA)]
## Draw a Venn diagram comparing DEGs for dataset A
Counts_datasetA <- matrix(0, nrow= length(allDEGs_datasetA), ncol=2)
row.names(Counts_datasetA)<- allDEGs_datasetA
colnames(Counts_datasetA)<- c("Unadj_A", "Ruv4_A_K_23")
for( i in 1:length(allDEGs_datasetA)) {
Counts_datasetA[i,1]<- allDEGs_datasetA[i] %in% DEGsUnadj_datasetA
Counts_datasetA[i,2]<- allDEGs_datasetA[i] %in% DEGsRUV4_datasetA_all_k[[K]]
}
col<- c("blue","darkgreen")
vennDiagram(vennCounts(Counts_datasetA),
circle.col=col,
cex=c(1.6, 1.2, 1), lwd=2)

Compare DEGs from the unadjusted and RUV4-adjusted data set B
K = 23
##----- Define DEGs for the RUV4-adjusted data with K = 23 in data set B
DEGsRUV4_datasetB <- row.names(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$b_X1) > 1]
##----- Look at the overalapping genes between unadjusted and RUV-4 adjusted data set B
allDEGs_datasetB <- c(DEGsUnadj_datasetB,
DEGsRUV4_datasetB_all_k[[K]])
## remove duplicated gene symbols:
allDEGs_datasetB <- allDEGs_datasetB[!duplicated(allDEGs_datasetB)]
## Draw a Venn diagram comparing DEGs for dataset B
Counts_datasetB <- matrix(0, nrow = length(allDEGs_datasetB), ncol = 2)
row.names(Counts_datasetB) <- allDEGs_datasetB
colnames(Counts_datasetB) <- c("Unadj_B", "Ruv4_B_K_23")
for( i in 1:length(allDEGs_datasetB)) {
Counts_datasetB[i,1]<- allDEGs_datasetB[i] %in% DEGsUnadj_datasetB
Counts_datasetB[i,2]<- allDEGs_datasetB[i] %in% DEGsRUV4_datasetB_all_k[[K]]
}
col <- c("blue", "darkgreen")
vennDiagram(vennCounts(Counts_datasetB),
circle.col = col,
cex = c(1.6, 1.2, 1), lwd = 2)

---
title: "How to remove unwanted variation from transcriptomics data using RUV4"
author: Momeneh (Sepideh) Foroutan and Marie Trussart
output:
  github_document:
    toc: yes
  html_notebook:
    number_sections: yes
    toc: yes
    toc_float: yes
    code_folding: "hide"
    fig_caption: yes
---


# Data description
In this tutorial, we aim to obtain differentially expressed genes (DEGs) between two biological conditions in an integrated data. This integrated data set is generated by combining 10 microarray studies on control and TGFb-treated samples (for more information about the data sets, see paper [A Transcriptional Program for Detecting TGFβ-Induced EMT in Cancer](http://mcr.aacrjournals.org/content/15/5/619) by Foroutan M et al, 2017.\ 

First, we assess the presence of unwanted variation in the integrated data which combines differents studies and platforms. Second, we show how to use RUV4 to remove unwanted variation and detect DEGs comparing control to TGFb-treated samples. Finally, we assess whether or not running RUV4 method helps, where we include the analysis of unadjusted data.
```{r load-libs}
library(ruv)            ## for applying RUV method
library(limma)          ## for vennDiagram()
library(ggplot2)        ## for data visualisation
library(knitr)          ## for Kable()
```

The integrated data introduced above have been split into two data sets: dataset A and dataset B, each containing different studies, platforms and tissues. We will explore and normalise these two data sets separately in order to compare the results obtained by the normalisation method RUV4.

Read in the two integrated datasets.
```{r load-data}
datasetA<- read.table("expr_10data_datasetA.txt", header = T, sep = "\t")
datasetB<- read.table("expr_10data_datasetB.txt", header = T, sep = "\t")
```

Look at the each data sets A and B, then make a matrix for each these data sets where the row names are gene IDs.
```{r explore-data}

kable(datasetA[1:4,1:4])
kable(datasetA[1:4,1:4])
mA<-datasetA[,2:dim(datasetA)[2]]
mB<-datasetB[,2:dim(datasetB)[2]]
row.names(mA)<- datasetA[,1]
row.names(mB)<- datasetB[,1]
mA<- as.matrix(mA)
mB<- as.matrix(mB)
```

Look at the information related to each dataset including the name of the studies, types of platform, treatment, and tissue:
```{r load-info-study}
info_datasetA <- read.table("info_10data_datasetA.txt", sep="\t", header=T)
kable(info_datasetA)
info_datasetB <- read.table("info_10data_datasetB.txt", sep="\t", header=T)
kable(info_datasetB)
```

# Assessment of unwanted variation in the data
Here we perform some exploratory analyses on the integrated data sets A and B to assess the presence of unwanted variation in each dataset.

## RLE plot
We start by looking at the RLE plots in dataset A, coloured by study, platform and tissue. In a "good" RLE plot, the median of all boxplots (i.e. samples) are around zero, while high deviations from zero show the presence of unwanted variation. For more information about RLE plot see  paper [RLE Plots: Visualising Unwanted Variation in High Dimensional Data](https://arxiv.org/abs/1704.03590), by Luke C. Gandolfo, Terence P. Speed, 2017.\
In the current example, it's important to note that in some cases, different studies are confounded with different platforms and tissues, and therefore there is no way to identify how much of the unwanted variation come from each of these factors. Such situations must be avoided when designing an experiment, however, it often happens when we perform data integration. For the purpose of this tutorial, we only consider "study" as the source of unwanted variation.\

```{r rleplots-datasetA}
## Transpose the expression matrix so that we have genes in columns and dataset in rows
YA <- t(mA)
## Plot RLE coloured by study
ruv_rle(YA, info_datasetA$study, ylim = c(-5,5))
## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$platform, ylim = c(-5,5))
## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$tissue, ylim = c(-5,5))
```

Similarly, we look at the RLE plots in dataset B coloured by study, platform and tissue:
```{r rleplots-datasetB}
YB <- t(mB)
## Plot RLE coloured by study
ruv_rle(YB, info_datasetB$study, ylim = c(-5,5))
## Plot RLE coloured by platform
ruv_rle(YB, info_datasetB$platform, ylim = c(-5,5))
## Plot RLE coloured by tissue
ruv_rle(YB, info_datasetB$tissue, ylim = c(-5,5))
```

## PCA plot
In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. To assess the presence of unwanted variation in each dataset, we use PCA plots to show similarities between samples measured in an unsupervised way. Ideally, samples should be clustered together according to the treatment (i.e. the biological factor of interest). Here, we see that samples are rather clustered by studies (i.e. unwanted variation) in both data sets A and B.\

```{r mds-plot-datasetA}
gg_additions <- list(aes(color = info_datasetA$study, 
                         shape = info_datasetA$treatment, 
                         size = 5, alpha = .7),
                     labs(color = "Study", shape = "Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width = 8, repr.plot.height = 6)
ruv_svdplot(YA) + gg_additions 
```

```{r mds-plot-datasetB}
gg_additions <- list(aes(color = info_datasetB$study,
                         shape = info_datasetB$treatment,
                         size = 5, alpha = .7),
                     labs(color = "Study",shape = "Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width=8, repr.plot.height=6)
ruv_svdplot(YB) + gg_additions #
```


# Remove batch effects using RUV-4
There are several RUV methods for removing unwanted variation in order to obtain DEGs, including RUV-2, RUV-4, RUV-inv and RUV-rinv. In general, RUV methods are dependent on negative control genes (genes which are not associated with the biological factor of interest) and replicate samples (if applicable). RUV-2 removes unwanted variation in two steps. RUV-4 was introduced after RUV-2 and has four steps. For RUV-4 we can select different values for dimension of unwanted variation (k), while for RUV-inv and RUV-rinv we don't need to change k as it is set to be the maximum value. RUV-inv is recommended when we have large number of control genes (~1000), while RUV-rinv is more appropriate with small number of control genes (~60).\

Selection of negative control genes is very important. Examples of negative control genes are the spike-in controls or the housekeeping (HK) genes. It is also possible to define empirical negative control genes using an iterative approach. However, it is only recomended if (i) the initial negative control genes are not very good or there are only a few of them; (ii) the beta seems to be very sparse. Therefore, the user needs to generate diagnostic plots to assess the performance of the initial analysis, and only if needed, use an iterative approach to define better control genes. For this tutorial, we use HK genes as negative control genes.


## RUV4
RUV4 can be run with different values of k in an attempt to find the optimal value. Note that although there is a function to estimate k, called getK(), this may not give the optimal value for k and is often recomended to be used when there is no other choice but to automate finding k (e.g. in simulations). 

We begin with the list of housekeeping (HK) genes as our negative controls.
```{r}
HKgenes <- read.table("HouseKeeping_genes_IDs.txt", header=T, sep="\t")
hk <- HKgenes$GeneID
ctrl <- colnames(YA) %in% hk
head(ctrl)
```

We also define our biological factor of interest (i.e. control versus TGFb treatment)
```{r}
## Take treatment as the biological factor of interest in data set A
groups_A <- factor(info_datasetA$treatment) 
gA <- cbind(as.numeric(groups_A))   ## 1 control, 2: TGFb

## Take treatment as the biological factor of interest in data set B
groups_B <- factor(info_datasetB$treatment) 
gB <- cbind(as.numeric(groups_B))   ## 1 control, 2: TGFb

```

### Unadjusted data
RUV4 with k = 0 performs no adjustment when obtaining DEGs, we do this first to see what would be the results if we did not adjust for unwanted variation.
```{r DE-unadjusted-A-B-using-HK}
# RUV4 with k = 0 for no adjustment
#----- In dataset A data
fit_unadj_hk_datasetA <- RUV4(YA, X = gA,
                          ctl = ctrl,
                          k = 0)
fit_unadj_hk_datasetA.summary <- ruv_summary(YA,
                                        fit_unadj_hk_datasetA,
                                        info_datasetA)

##----- In dataset B data
fit_unadj_hk_datasetB <- RUV4(YB, X = gB,
                          ctl = ctrl,
                          k = 0)
fit_unadj_hk_datasetB.summary <- ruv_summary(YB,
                                         fit_unadj_hk_datasetB,
                                         info_datasetB)

```

Look at the consistency between the results from data sets A and B when no adjustment is perfromed. Here, we look at the correlation between betahats obtained in the two data sets.

```{r}
##----- Unadjusted data sets
plot(fit_unadj_hk_datasetA$betahat, 
     fit_unadj_hk_datasetB$betahat,
     xlab = "Betahat dataset A",
     ylab = "Betahat dataset B",
     main = "Unadjusted",
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
corVal <- cor.test(fit_unadj_hk_datasetA$betahat, 
                   fit_unadj_hk_datasetB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))

```


### RUV4-adjusted data with different values of k
``` {r ruv4-HK-differentKs}
## Select different values of k
ks <- c( 1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)

## Define empty lists and use them in the for loop to store the results
fit_ruv4_hk_datasetA_all_k <- list()
fit_ruv4_hk_datasetA_all_k.summary <- list()
fit_ruv4_hk_datasetB_all_k <- list()
fit_ruv4_hk_datasetB_all_k.summary <- list()

for (K in ks){
  fit_ruv4_hk_datasetA_all_k[[K]] <- RUV4(YA, X = gA, 
                            ctl = ctrl, 
                            k = K, Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_datasetA_all_k.summary[[K]] <- ruv_summary(YA,
                                           fit_ruv4_hk_datasetA_all_k[[K]],
                                           info_datasetA)
  
  fit_ruv4_hk_datasetB_all_k[[K]] <- RUV4(YB, X = gB, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_datasetB_all_k.summary[[K]] <- ruv_summary(YB,
                                           fit_ruv4_hk_datasetB_all_k[[K]],
                                           info_datasetB)
  
  plot(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
     fit_ruv4_hk_datasetB_all_k[[K]]$betahat,
     xlab = "Betahat dataset A",
     ylab = "Betahat dataset B",
     main = paste("RUV4_K_", K, sep=""),
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
  #abline(fit_ruv4_emp_datasetB$betahat,fit_ruv4_emp_datasetA$betahat)
  corVal <- cor.test(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
                     fit_ruv4_hk_datasetB_all_k[[K]]$betahat)$estimate
  text(-3,3, pos=4, paste("Correlation: ", round(corVal,2),sep=""))
  
}
```

## P-value distributions for unadjusted and RUV4-adjusted data sets
Here we compare the results obtained from unajusted and RUV4- adjusted data using p-value distributions.

```{r}
## We select K = 23 
K = 23 
##----- In data A
ruv_hist(fit_unadj_hk_datasetA.summary) + ggtitle("Unadj_A")
ruv_hist(fit_ruv4_hk_datasetA_all_k.summary[[K]]) + ggtitle("RUV4_HK_A")

## ---- In data B
ruv_hist(fit_unadj_hk_datasetB.summary) + ggtitle("Unadj_B")
ruv_hist(fit_ruv4_hk_datasetB_all_k.summary[[K]]) + ggtitle("RUV4_HK_B")
```

## Overlapping DEGs between data sets A and B 
Finally, for each of the unadjusted and RUV4-adjusted data sets, we look at the number of overlapping differentially expressed genes (DEGs) between the two data sets A and B.\

First, we define DEGs as genes with adjusted p-value < 0.05 and |logFC| > 1 in data sets A and B which are unadjusted, RUV4-adjusted with K = 23, and RUV-4 adjusted with different values of K. In the next step, we look at the overlapping genes.

### DEGs in the unadjusted data sets
Venn diagram comparing DEGs obtained from the unadjusted data sets A and B.
```{r Venn-DEGs-unadj}
##------ Define DEGs in the unadjusted data set A
DEGsUnadj_datasetA <- row.names(fit_unadj_hk_datasetA.summary$C)[fit_unadj_hk_datasetA.summary$C$F.p.BH < 0.05 & abs(fit_unadj_hk_datasetA.summary$C$b_X1) > 1]

##------ Define DEGs in the unadjusted data set B
DEGsUnadj_datasetB <- row.names(fit_unadj_hk_datasetB.summary$C)[fit_unadj_hk_datasetB.summary$C$F.p.BH < 0.05 & abs(fit_unadj_hk_datasetB.summary$C$b_X1) > 1]

##------ Explore how many genes are overlapping between data sets A and B
allDEGs_unadj <- c(DEGsUnadj_datasetA, 
                    DEGsUnadj_datasetB)

## remove duplicated gene symbols:
allDEGs_unadj <- allDEGs_unadj[!duplicated(allDEGs_unadj)]  

## Draw a Venn diagram comparing DEGs for RUVinv
Counts_unadj <- matrix(0, nrow = length(allDEGs_unadj), ncol = 2)
row.names(Counts_unadj)<- allDEGs_unadj
colnames(Counts_unadj)<- c("Unadj_A","Unadj_B")

for( i in 1:length(allDEGs_unadj)) {
  Counts_unadj[i,1]<- allDEGs_unadj[i] %in% DEGsUnadj_datasetA
  Counts_unadj[i,2]<- allDEGs_unadj[i] %in% DEGsUnadj_datasetB
}

col <- c("blue", "violet")
vennDiagram(vennCounts(Counts_unadj), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd=2)

```


### DEGs in the RUV4-adjusted data sets for different values of k
Venn diagram comparing the RUV4-adjusted data sets A and B using different ks.
```{r Venn-DEGs-RUV4}

##---- Define DEGs for different values of k in data set A
DEGsRUV4_datasetA_all_k <- list()
for (K in ks){
  DEGsRUV4_datasetA_all_k[[K]] <- row.names(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$b_X1) > 1]  
}

##---- Define DEGs for different values of k in data set B
DEGsRUV4_datasetB_all_k <- list()
for (K in ks){
  DEGsRUV4_datasetB_all_k[[K]] <- row.names(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$b_X1) > 1]  
}

##---- Generate Venn diagrams, looking at the overlapping genes 
##     between data sets A and B for different values of k

allDEGs_RUV4_all_k <- list()

for (K in ks){
  allDEGs_RUV4_all_k[[K]] <- c(DEGsRUV4_datasetA_all_k[[K]],
                               DEGsRUV4_datasetB_all_k[[K]])
  ## remove duplicated gene symbols:
  allDEGs_RUV4_all_k[[K]] <- allDEGs_RUV4_all_k[[K]][!duplicated(allDEGs_RUV4_all_k[[K]])]  
  ## Draw a Venn diagram comparing DEGs for RUV4
  Counts_RUV4<- matrix(0, nrow = length(allDEGs_RUV4_all_k[[K]]), ncol=2)
  row.names(Counts_RUV4) <- allDEGs_RUV4_all_k[[K]]
  colnames(Counts_RUV4) <- c(paste("RUV4_A_K_", K, sep=""), paste("RUV4_B_K_",K,sep=""))

  for( i in 1:length(allDEGs_RUV4_all_k[[K]])) {
    Counts_RUV4[i,1]<- allDEGs_RUV4_all_k[[K]][i] %in% DEGsRUV4_datasetA_all_k[[K]]
    Counts_RUV4[i,2]<- allDEGs_RUV4_all_k[[K]][i] %in% DEGsRUV4_datasetB_all_k[[K]]
  }

  col<- c("blue", "violet")
  vennDiagram(vennCounts(Counts_RUV4), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)
}

```

## Overlapping DEGs between unadjusted and RUV4-adjusted data sets
### Compare DEGs from the unadjusted and RUV4-adjusted data set A
```{r Venn-DEGs-datasetA}
K = 23
##----- Define DEGs for the RUV4-adjusted data with K = 23 in data set A
DEGsRUV4_datasetA <- row.names(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$b_X1) > 1]  

##----- Look at the overalapping genes between unadjusted and RUV-4 adjusted data set A
allDEGs_datasetA <- c(DEGsUnadj_datasetA,  
                      DEGsRUV4_datasetA_all_k[[K]])

## remove duplicated gene symbols:
allDEGs_datasetA <- allDEGs_datasetA[!duplicated(allDEGs_datasetA)]  

## Draw a Venn diagram comparing DEGs for dataset A
Counts_datasetA <- matrix(0, nrow= length(allDEGs_datasetA), ncol=2)
row.names(Counts_datasetA)<- allDEGs_datasetA
colnames(Counts_datasetA)<- c("Unadj_A", "Ruv4_A_K_23")

for( i in 1:length(allDEGs_datasetA)) {
  Counts_datasetA[i,1]<- allDEGs_datasetA[i] %in% DEGsUnadj_datasetA
  Counts_datasetA[i,2]<- allDEGs_datasetA[i] %in% DEGsRUV4_datasetA_all_k[[K]]
}

col<- c("blue","darkgreen")
vennDiagram(vennCounts(Counts_datasetA), 
            circle.col=col, 
            cex=c(1.6, 1.2, 1), lwd=2)

```

### Compare DEGs from the unadjusted and RUV4-adjusted data set B 
```{r Venn-DEGs-datasetB}
K = 23
##----- Define DEGs for the RUV4-adjusted data with K = 23 in data set B
DEGsRUV4_datasetB <- row.names(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$b_X1) > 1]  

##----- Look at the overalapping genes between unadjusted and RUV-4 adjusted data set B
allDEGs_datasetB <- c(DEGsUnadj_datasetB,
                      DEGsRUV4_datasetB_all_k[[K]])

## remove duplicated gene symbols:
allDEGs_datasetB <- allDEGs_datasetB[!duplicated(allDEGs_datasetB)] 

## Draw a Venn diagram comparing DEGs for dataset B
Counts_datasetB <- matrix(0, nrow = length(allDEGs_datasetB), ncol = 2)
row.names(Counts_datasetB) <- allDEGs_datasetB
colnames(Counts_datasetB) <- c("Unadj_B", "Ruv4_B_K_23")

for( i in 1:length(allDEGs_datasetB)) {
  Counts_datasetB[i,1]<- allDEGs_datasetB[i] %in% DEGsUnadj_datasetB
  Counts_datasetB[i,2]<- allDEGs_datasetB[i] %in% DEGsRUV4_datasetB_all_k[[K]]
}

col <- c("blue", "darkgreen")
vennDiagram(vennCounts(Counts_datasetB), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)
```






